Fix encoding of non-ascii field names in ignored source by jordan-powers · Pull Request #131950 · elastic/elasticsearch

jordan-powers · 2025-07-25T22:45:07Z

When encoding an ignored source entry, we use String.length() to get the length of the encoded field name. This will only work when the UTF-8 encoding has only ascii characters, with 1 byte per character.

The solution is to use the actual length of the encoded field name byte[] array.

I added a test that encodes a field in _ignored_source with a random unicode key. Without this fix, it fails with this stack trace:

java.lang.IllegalArgumentException: Can't decode [9c bd f2 8c b0 91 f1 80 ac 9c 53 f0 9e ad af f0 ae a2 b3 f1 9a 86 bd f1 a7 b5 86 f3 ac b0 82 e7 b5 b0 f1 9f 8f b3 f3 a7 b7 94 f2 91 9f bd f1 a0 80 af]
        at __randomizedtesting.SeedInfo.seed([1:80DEBC9A7DDCB0FE]:0)
        at org.elasticsearch.index.mapper.XContentDataHelper.decodeAndWrite(XContentDataHelper.java:109)
        at org.elasticsearch.index.mapper.XContentDataHelper.writeMerged(XContentDataHelper.java:196)
        at org.elasticsearch.index.mapper.ObjectMapper$SyntheticSourceFieldLoader$FieldWriter$IgnoredSource.writeTo(ObjectMapper.java:1230)
        at org.elasticsearch.index.mapper.ObjectMapper$SyntheticSourceFieldLoader.write(ObjectMapper.java:1141)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$SyntheticLeaf.write(SourceLoader.java:240)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$SyntheticLeaf.source(SourceLoader.java:199)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$LeafWithMetrics.source(SourceLoader.java:162)
        at org.elasticsearch.search.lookup.ConcurrentSegmentSourceProvider$Leaf.getSource(ConcurrentSegmentSourceProvider.java:79)
        at org.elasticsearch.search.lookup.ConcurrentSegmentSourceProvider.getSource(ConcurrentSegmentSourceProvider.java:58)
        at org.elasticsearch.index.mapper.MapperServiceTestCase.syntheticSource(MapperServiceTestCase.java:862)
        at org.elasticsearch.index.mapper.MapperServiceTestCase.syntheticSource(MapperServiceTestCase.java:816)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.getSyntheticSourceWithFieldLimit(IgnoredSourceFieldMapperTests.java:62)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.getSyntheticSourceWithFieldLimit(IgnoredSourceFieldMapperTests.java:54)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.testIgnoredStringFullUnicode(IgnoredSourceFieldMapperTests.java:134)

elasticsearchmachine · 2025-07-25T22:45:32Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-07-25T22:45:32Z

Hi @jordan-powers, I've created a changelog YAML for you.

martijnvg

Can you confirm my thinking in the comment I left? Otherwise LGTM.

martijnvg · 2025-07-28T02:50:46Z

server/src/main/java/org/elasticsearch/index/mapper/IgnoredSourceFieldMapper.java

        byte[] nameBytes = values.name.getBytes(StandardCharsets.UTF_8);
        byte[] bytes = new byte[4 + nameBytes.length + values.value.length];
-        ByteUtils.writeIntLE(values.name.length() + PARENT_OFFSET_IN_NAME_OFFSET * values.parentOffset, bytes, 0);
+        ByteUtils.writeIntLE(nameBytes.length + PARENT_OFFSET_IN_NAME_OFFSET * values.parentOffset, bytes, 0);


Just double checking, there is no need for an index version check here, given that decode isn't updated in this change. In other words, Indexing new documents in indices with older index version, would result in the error described in the PR description to not occur.

jordan-powers · 2025-07-28T15:00:25Z

I realized that I can update the decode to handle the old format. This way we won't lose non-ascii data written in the old format. I've opened another PR: #132018

jordan-powers · 2025-07-28T16:47:14Z

Closed in favor of the solution in #132018

jordan-powers added 2 commits July 25, 2025 15:32

Add unicode test to IgnoredSourceFieldMapperTests

0a26484

Use proper byte length in IgnoredSourceFieldMapper#encode

4383d50

jordan-powers requested a review from kkrik-es July 25, 2025 22:45

jordan-powers self-assigned this Jul 25, 2025

jordan-powers added >bug :StorageEngine/Mapping The storage related side of mappings v9.2.0 labels Jul 25, 2025

elasticsearchmachine added the Team:StorageEngine label Jul 25, 2025

jordan-powers and others added 2 commits July 25, 2025 15:45

Update docs/changelog/131950.yaml

d389317

Remove sourcefilter tests

c635256

jordan-powers added v9.1.1 v8.19.1 v9.0.5 v8.18.5 auto-backport Automatically create backport pull requests when merged labels Jul 25, 2025

martijnvg approved these changes Jul 28, 2025

View reviewed changes

jordan-powers mentioned this pull request Jul 28, 2025

Fix decoding of non-ascii field names in ignored source #132018

Merged

jordan-powers closed this Jul 28, 2025

jordan-powers deleted the fix-utf-encoding-ignored-source branch July 28, 2025 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding of non-ascii field names in ignored source#131950

Fix encoding of non-ascii field names in ignored source#131950
jordan-powers wants to merge 4 commits intoelastic:mainfrom
jordan-powers:fix-utf-encoding-ignored-source

jordan-powers commented Jul 25, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Jul 25, 2025

Uh oh!

elasticsearchmachine commented Jul 25, 2025

Uh oh!

martijnvg left a comment

Uh oh!

martijnvg Jul 28, 2025

Uh oh!

jordan-powers commented Jul 28, 2025

Uh oh!

jordan-powers commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jordan-powers commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 25, 2025

Uh oh!

elasticsearchmachine commented Jul 25, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

jordan-powers commented Jul 28, 2025

Uh oh!

jordan-powers commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jordan-powers commented Jul 25, 2025 •

edited

Loading